The data set contains 1599 observations and 13 numeric variables. The first caution call comes from the skew’s and kurtosis ’columns displayed by the following output. Both measures provide invaluable clues to understand the shape of the distribution. The former is all about symmetry whereas the later focuses on the tailedeness of the data. Applying the conventional benchmarks leads to an estimation of the prevailing right-skewness in the univariate distributions of five of the components. Four of them are heavily leptokurtic, with one exception. As a consequence, the propensity of outliers must be reckoned, but without further information about the nature of the collected observations; removing them might cause to overpass noteworthy wines within a given cluster. Outliers would be present through all subsequent analysis.
#Load the Data
data <- read_csv("wineQualityReds.csv") #readr
#headTail(data)
obj <- describe(data[, 2:13])#psych
kable(obj, caption = "Table 1.Summary Statistics", digits = 3)
Assessing Normality
The size of the following display is proportional to the amount of information conveyed by it. The univariate distributions with the exception of density, pH, and quality are highly right-skewed. A short statement, but with huge implications about the necessity to normalize the data, or to draw upon non-parametric tests.
# Assessing Normality, Shape of the Distribution
#class(data) #class: tbl_df, tbl, data.frame
data_1 = as.data.frame(data[ ,2:13], na.omit = TRUE)# convert to a data.frame as required by following plot
#class(data_1) # class: data.frame
par(mfrow = c(3, 4))
for(i in 1:12){
plotNormalHistogram(data_1[,i], main = names(data_1)[i])#rcompanion
}
Assessing Collinearity
The object of the correlation chart is straightforward: to detect variables containing the same information about the dependent variable. Besides implying redundancy, collinearity among the explanatory variables would tend to produce less precise model than if the predictors were uncorrelated. Once more, an R built-in functions makes extremely easy to capture such a vital insight. The stars represents the significance levels .
#Assessing Collinearity
chart.Correlation(data[,2:13], method = "pearson", histogram = TRUE, pch = 16) #default pearson, PerformanceAnalytics
We know by now that the variables chloride and alcohol would not pair well in a regression model, neither those sharing sulfur dioxide. Residual sugar might work well with alcohol, or chloride, but not with the sulfurs. if the current assignment’s goal would be more about statistical modeling, than merely exploratory data analysis; interactions among the mentioned predictors should be quantified.
Mean Comparison across Ratings
The next step is to understand how a given explanatory variable behaves across the different ratings. Being a meaningful predictor would imply that the result of mapping that attribute across the different qualifications is not a flat line, but that of a significant increase or decrease occur when tracing the line. To this end, the data has been transformed to a long format, and grouped by attribute and qualification. The abbreviated output is as follows:
d_feature <- data %>%
gather(profile, value, 2:12) %>%
mutate(Rating= factor(quality, levels = unique(quality)),
Attribute = factor(profile, levels = unique(profile))) %>%
select(-X1, -quality, -profile)%>%
group_by(Rating, Attribute)
#headTail(d_feature)
#levels(d_feature$feature)
s_feature <- Summarize(value ~ Rating + Attribute, data= d_feature) #FSA
kable(head(s_feature, 12)[,c(1,2,3,5,6)], caption = "table 2.Long Data")
table 2.Long Data
| 5 |
fixed.acidity |
681 |
8.1672540 |
1.5639880 |
| 6 |
fixed.acidity |
638 |
8.3471787 |
1.7978488 |
| 7 |
fixed.acidity |
199 |
8.8723618 |
1.9924833 |
| 4 |
fixed.acidity |
53 |
7.7792453 |
1.6266245 |
| 8 |
fixed.acidity |
18 |
8.5666667 |
2.1196559 |
| 3 |
fixed.acidity |
10 |
8.3600000 |
1.7708755 |
| 5 |
volatile.acidity |
681 |
0.5770411 |
0.1648012 |
| 6 |
volatile.acidity |
638 |
0.4974843 |
0.1609623 |
| 7 |
volatile.acidity |
199 |
0.4039196 |
0.1452244 |
| 4 |
volatile.acidity |
53 |
0.6939623 |
0.2201100 |
| 8 |
volatile.acidity |
18 |
0.4233333 |
0.1449138 |
| 3 |
volatile.acidity |
10 |
0.8845000 |
0.3312556 |
Seeing the data, two caveats immediately emerge:
Is the mean the appropriate central measure of tendency to be applied to the data set? Based on Figure 1. Normal Histograms Plot a more robust measure would be a better approach.
As per last histogram of previously mentioned display, there is a significant difference in sample size in rating 3, and 8 when compared to rating 5 and 6.
For assignment’s sake, the mean will be the measure of choice, and later on, adequate technics would be employed to lessen uncertainty concerning the sample size.
p_fm <- ggplot(s_feature, aes( Rating, mean, color = Rating, group = 1))+
geom_point()+
geom_line()+
facet_wrap(~Attribute)+
ggtitle("Mean Comparison by Attribute across Rating") +
theme_stata(scheme = "s2mono") + #ggthemes
scale_colour_stata("mono")
p_fm
At glance, differences are observable in total.sulfur.dioxide, sulfur.dioxide, fixed.acidity, and alcohol, in decreasing order. The rest of the potential explanatory variable present, at sight, little or no variability across the different ratings. A disappointed fact as they hardly would account for good predictions of the dependent variable.